Web Translation Mining Based on Suffix Arrays
نویسندگان
چکیده
Mining translations from abundant Web data can be applied in many fields such as computer assisted learning, machine translation and cross-language information retrieval. How to mine possible translations from the Web and obtain the boundary of candidates, and how to remove irrelevant noises and rank the candidates are the challenging issues. In this paper, after reviewing and analyzing all possible methods of acquiring translations, a statistics method based on suffix arrays is proposed to mine term translations from the Web. The proposed method can not only mine different forms of Web translation distributions but also effectively obtain the correct boundary of translations, and then sort-based subset deletion and mutual information methods are respectively proposed to deal with subset redundancy information and affix redundancy information formed in the process of estimation. Experiments on two test sets of 401 English-Chinese terms and 100 English-Japanese terms validate that our system has good performance.
منابع مشابه
Web-Based Terminology Translation Mining
Mining terminology translation from a large amount of Web data can be applied in many fields such as reading/writing assistant, machine translation and cross-language information retrieval. How to find more comprehensive results from the Web and obtain the boundary of candidate translations, and how to remove irrelevant noises and rank the remained candidates are the challenging issues. In this...
متن کاملSemi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages
This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-ba...
متن کاملDistributed text search using suffix arrays
Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, wh...
متن کاملHierarchical Phrase-Based Translation with Suffix Arrays
A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this problem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lookup and extract rules on the fly. Hierarchical phrasebased translation introduces the added wrink...
متن کاملEfficient Discovery of Proximity Patterns with Suffix Arrays
We describe an efficient implementation of a text mining algorithm for discovering a class of simple string patterns. With an index structure, called the virtual suffix tree, for pattern discovery built on the top of the suffix array, the resulting algorithm is simple and fast in practice compared with the previous implementation with the suffix tree.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Chinese Language and Computing
دوره 17 شماره
صفحات -
تاریخ انتشار 2007